# Multilingual visual understanding
Qwen2.5 VL 72B Instruct GGUF
Other
A multimodal large model launched by Tongyi Qianwen, supporting image and text generation and 128k long context processing, with multilingual capabilities.
Image-to-Text English
Q
lmstudio-community
668
1
Aya Vision 32b
Aya Vision 32B is an open-weight 32B parameter multimodal model developed by Cohere Labs, supporting vision-language tasks in 23 languages.
Image-to-Text
Transformers Supports Multiple Languages

A
CohereLabs
387
193
Aya Vision 8b
Aya Vision 8B is an open-weight 8-billion-parameter multilingual vision-language model supporting visual and language tasks in 23 languages.
Image-to-Text
Transformers Supports Multiple Languages

A
CohereLabs
29.94k
282
Llama 3.2 11B Vision Instruct Abliterated 8 Bit
This is a multimodal model based on Llama-3.2-11B-Vision-Instruct, which supports image and text input and generates text output.
Image-to-Text
Transformers Supports Multiple Languages

L
mlx-community
128
0
Pix2struct Screen2words Base
Apache-2.0
Pix2Struct is a vision-language understanding model optimized for generating functional description captions from UI interface screenshots
Image-to-Text
Transformers Supports Multiple Languages

P
google
262
24
Featured Recommended AI Models